Research Question
Is there a relationship between mode of transportation to travel to work and different health indicators? How do other demographic characteristics (eg. race and age) factor in, if at all?
Prior Research
Riggs and Sethi (2020), in examining a quantified, indexed version of “walkability”, have determined that neighborhoods scoring high in this category do show increased multimodal/alternative transportation use (namely – walking, cycling, or transit). With more specificity, Watson et. al (2020) provide a national-level overview that confirms a strong link between high walkability scores and walking itself as a mode of commuting to work throughout the urban areas of the United States. Our research question seeks to confirm these findings through selected census and walkability study data.
Data
We will be using the following data sets:
- United States American Community Survey (2019)
- Centers for Disease Control and Prevention: PLACES U.S. Chronic Disease Indicators (2020)
- United States Environmental Protection Agency: National Walkability Index (2021)
Variables for Analysis
Continuous * Population Percentage with Current Mental Health Issues * Population Percentage with Obesity * Population Percentage with Current Asthma * Percentage of Mode of Transportation to Work * Walkability Index Score
Categorical * Majority White vs. Non-White * Majority Female vs. Not Female
Sample Definition
Our research will focus on the scale of census tracts in Massachusetts, with a look at adult characteristics and behaviors aggregated to the level of census tracts.
Load data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidycensus)
library(readxl)
library(knitr)
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(ggthemes)
library(ggspatial)
library(jtools)
library(tigris)
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
library(huxtable)
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
##
## add_rownames
## The following object is masked from 'package:ggplot2':
##
## theme_grey
Mode of Transportation
Source: American Community Survey (2019)
- Mode of Transportation
acs_transport <- get_acs(geography = "tract",
year = 2019,
variables = c(transport_any = "B08301_001E",
transport_walk = "B08301_019E",
transport_car = "B08301_002E",
transport_public = "B08301_010E",
transport_bike = "B08301_018E",
transport_taxi = "B08301_016E",
transport_motorcycle = "B08301_017E",
transport_other = "B08301_020E"),
state = "MA",
output = "wide",
geometry = FALSE) %>%
mutate(pct_walk = transport_walk / transport_any) %>%
mutate(pct_car = transport_car / transport_any) %>%
mutate(pct_public = transport_public / transport_any) %>%
mutate(pct_bike = transport_bike / transport_any) %>%
mutate(pct_taxi = transport_taxi / transport_any) %>%
mutate(pct_motorcycle = transport_motorcycle / transport_any) %>%
mutate(pct_other = transport_other / transport_any)%>%
select(GEOID, pct_walk, pct_car, pct_public, pct_bike, pct_taxi, pct_motorcycle, pct_other)
Demographics
Source: American Community Survey (2019)
- Majority White vs. Non-White (categorical)
- Majority Female vs. Not Female (categorical)
race_majority <- get_acs(geography = "tract",
year = 2019,
variables = c( total_pop = "B02001_001",
white_pop = "B02001_002"),
state = "MA",
output = "wide",
geometry = FALSE) %>%
mutate(white_majority = case_when(white_popE / total_popE >= .5 ~ "Majority White",
white_popE / total_popE < .5 ~ "Non-Majority White",
TRUE ~ "unknown")) %>%
select(GEOID,white_majority)
## Getting data from the 2015-2019 5-year ACS
sex_majority <- get_acs(geography = "tract",
year = 2019,
variables = c( total_pop = "B01001_001",
female_pop = "B01001_026"),
state = "MA",
output = "wide",
geometry = FALSE) %>%
mutate(female_majority = case_when(female_popE / total_popE >= .5 ~ "Female Majority",
female_popE / total_popE < .5 ~ "Non-Female Majority",
TRUE ~ "unknown")) %>%
select(GEOID,female_majority)
## Getting data from the 2015-2019 5-year ACS
U.S. Chronic Disease Indicators
Source: United States Centers for Disease Control and Prevention: PLACES U.S. Chronic Disease Indicators
- Population Percentage with Current Mental Health Issues (continuous)
- Population Percentage with Obesity (continuous)
- Population Percentage with Current Asthma (continuous)
Mental_Health <- read_csv('data/PLACES__Local_Data_for_Better_Health__Census_Tract_Data_2020_release.csv') %>%
filter(Year == 2018) %>%
filter(Short_Question_Text == "Mental Health") %>%
filter(StateAbbr == "MA") %>%
rename(GEOID = LocationID) %>%
group_by(GEOID) %>%
rename(pct_mental = Data_Value) %>%
select(GEOID, pct_mental, Data_Value_Unit)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 2024865 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): StateAbbr, StateDesc, CountyName, CountyFIPS, LocationName, DataSo...
## dbl (5): Year, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Tot...
## lgl (2): Data_Value_Footnote_Symbol, Data_Value_Footnote
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Asthma <- read_csv('data/PLACES__Local_Data_for_Better_Health__Census_Tract_Data_2020_release.csv') %>%
filter(Year == 2018) %>%
filter(Short_Question_Text == "Current Asthma") %>%
filter(StateAbbr == "MA") %>%
rename(GEOID = LocationID) %>%
group_by(GEOID) %>%
rename(pct_asthma = Data_Value) %>%
select(GEOID, pct_asthma, Data_Value_Unit)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 2024865 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): StateAbbr, StateDesc, CountyName, CountyFIPS, LocationName, DataSo...
## dbl (5): Year, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Tot...
## lgl (2): Data_Value_Footnote_Symbol, Data_Value_Footnote
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Obesity <- read_csv('data/PLACES__Local_Data_for_Better_Health__Census_Tract_Data_2020_release.csv') %>%
filter(Year == 2018) %>%
filter(Short_Question_Text == "Obesity") %>%
filter(StateAbbr == "MA") %>%
rename(GEOID = LocationID) %>%
group_by(GEOID) %>%
rename(pct_obesity = Data_Value) %>%
select(GEOID, pct_obesity, Data_Value_Unit)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 2024865 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): StateAbbr, StateDesc, CountyName, CountyFIPS, LocationName, DataSo...
## dbl (5): Year, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Tot...
## lgl (2): Data_Value_Footnote_Symbol, Data_Value_Footnote
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Walkability
Source: United States Environmental Protection Agency: National Walkability Index
- Walkability Index Score
# The following code was used to created the original "Walkability" file, it was then re-loaded as a .shp file
# This was done to reduce the file size and because the .shp file doesn't except "avg_w_ind" as a column name (too long apparently).
# Walkability <- st_read(dsn = "Natl_WI.gdb") %>%
# filter(STATEFP == "25") %>%
# select(GEOID10, NatWalkInd, TotPop) %>%
# mutate(tract = substr(GEOID10, 1, 11)) %>%
# st_set_geometry(NULL) %>%
# group_by(tract) %>%
# summarise(avg_w_ind = weighted.mean(NatWalkInd, TotPop)) %>%
# rename(GEOID = tract)
#
# st_write(Walkability, "data/Walkability_file.shp")
# print(Walkability)
Walkability <- st_read(dsn = "data/Walkability_file.dbf")
## Reading layer `Walkability_file' from data source
## `/Users/mc/Documents/2_ SCHOOL/2_ Harvard/1_ Classes/S1/6_ Quant/Assignment 5/quant_assignment_1/data/Walkability_file.dbf'
## using driver `ESRI Shapefile'
## Warning: no simple feature geometries present: returning a data.frame or tbl_df
Assembling the datasets
data <- left_join(acs_transport, Walkability) %>%
left_join(Asthma) %>%
left_join(Obesity) %>%
left_join(Mental_Health) %>%
left_join(race_majority) %>%
left_join(sex_majority) %>%
select(GEOID, avg_w_ind, pct_walk, pct_car, pct_public, pct_bike, pct_taxi, pct_motorcycle, pct_other, pct_asthma, pct_mental, pct_obesity, white_majority, female_majority)
kable(head(data))
| GEOID | avg_w_ind | pct_walk | pct_car | pct_public | pct_bike | pct_taxi | pct_motorcycle | pct_other | pct_asthma | pct_mental | pct_obesity | white_majority | female_majority |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25027731001 | 11.333333 | 0.0052493 | 0.9755031 | 0.0131234 | 0.0000000 | 0 | 0 | 0.0000000 | 10.1 | 14.0 | 30.8 | Majority White | Female Majority |
| 25027709501 | 7.333333 | 0.0000000 | 0.9022624 | 0.0199095 | 0.0000000 | 0 | 0 | 0.0000000 | 9.7 | 12.6 | 30.7 | Majority White | Female Majority |
| 25027709502 | 9.525729 | 0.0102145 | 0.9417773 | 0.0108955 | 0.0000000 | 0 | 0 | 0.0000000 | 10.0 | 12.7 | 30.0 | Majority White | Female Majority |
| 25027721101 | 3.833333 | 0.0078125 | 0.9101562 | 0.0029297 | 0.0029297 | 0 | 0 | 0.0146484 | 10.0 | 13.5 | 31.0 | Majority White | Female Majority |
| 25027721102 | 5.537969 | 0.0000000 | 0.9310201 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 9.6 | 12.4 | 29.6 | Majority White | Non-Female Majority |
| 25027731002 | 12.806539 | 0.0000000 | 0.9193947 | 0.0648548 | 0.0000000 | 0 | 0 | 0.0000000 | 11.0 | 16.1 | 35.4 | Majority White | Female Majority |
Initial Analysis of Continuous Data
asthma_t_test <- t.test(data$pct_asthma)
obesity_t_test <- t.test(data$pct_obesity)
mental_t_test <- t.test(data$pct_mental)
walkability_t_test <- t.test(data$avg_w_ind)
mwalk_t_test <- t.test(data$pct_walk)
mcar_t_test <- t.test(data$pct_car)
mpublic_t_test <- t.test(data$pct_public)
mbike_t_test <- t.test(data$pct_bike)
mtaxi_t_test <- t.test(data$pct_taxi)
mmotorcycle_t_test <- t.test(data$pct_motorcycle)
mother_t_test <- t.test(data$pct_other)
asthma_quartiles <- quantile(data$pct_asthma, na.rm = TRUE)
obesity_quartiles <- quantile(data$pct_obesity, na.rm = TRUE)
mental_quartiles <- quantile(data$pct_mental, na.rm = TRUE)
walk_quartiles <- quantile(data$pct_walk, na.rm = TRUE)
bike_quartiles <- quantile(data$pct_bike, na.rm = TRUE)
car_quartiles <- quantile(data$pct_car, na.rm = TRUE)
taxi_quartiles <- quantile(data$pct_taxi, na.rm = TRUE)
public_quartiles <- quantile(data$pct_public, na.rm = TRUE)
motorcycle_quartiles <- quantile(data$pct_motorcycle, na.rm = TRUE)
other_quartiles <- quantile(data$pct_other, na.rm = TRUE)
asthma_st_dev <- sd(data$pct_asthma, na.rm = TRUE)
obesity_st_dev <- sd(data$pct_obesity, na.rm = TRUE)
mental_st_dev <- sd(data$pct_mental, na.rm = TRUE)
walk_st_dev <- sd(data$pct_walk, na.rm = TRUE)
bike_st_dev <- sd(data$pct_bike, na.rm = TRUE)
car_st_dev <- sd(data$pct_car, na.rm = TRUE)
taxi_st_dev <- sd(data$pct_taxi, na.rm = TRUE)
public_st_dev <- sd(data$pct_public, na.rm = TRUE)
motorcycle_st_dev <- sd(data$pct_motorcycle, na.rm = TRUE)
other_st_dev <- sd(data$pct_other, na.rm = TRUE)
asthma_hist <- ggplot(data) +
geom_histogram(aes(x = pct_asthma),
bins = 30)
obesity_hist <- ggplot(data) +
geom_histogram(aes(x = pct_obesity),
bins = 30) +
scale_x_continuous(trans = "log")
mental_hist <- ggplot(data) +
geom_histogram(aes(x = pct_mental),
bins = 30)
asthma_hist
## Warning: Removed 15 rows containing non-finite values (stat_bin).
mode_summary <- tibble(
Variable = c("Percentage of Workers Who Commute by Walking",
"Percentage of Workers Who Commute by Bike",
"Percentage of Workers Who Commute by Car",
"Percentage of Workers Who Commute by Public Transit",
"Percentage of Workers Who Commute by Taxi",
"Percentage of Workers Who Commute by Motorcycle",
"Percentage of Workers Who Commute by Other Means"),
`Sample mean` = c((mwalk_t_test$estimate*100),
(mbike_t_test$estimate*100),
(mcar_t_test$estimate*100),
(mpublic_t_test$estimate*100),
(mtaxi_t_test$estimate*100),
(mmotorcycle_t_test$estimate*100),
(mother_t_test$estimate*100)),
`Population mean (95% confidence) - low` =
c((mwalk_t_test$conf.int[1]*100),
(mbike_t_test$conf.int[1]*100),
(mcar_t_test$conf.int[1]*100),
(mpublic_t_test$conf.int[1]*100),
(mtaxi_t_test$conf.int[1]*100),
(mmotorcycle_t_test$conf.int[1]*100),
(mother_t_test$conf.int[1]*100)),
`Population mean (95% confidence) - high` =
c((mwalk_t_test$conf.int[2]*100),
(mbike_t_test$conf.int[2]*100),
(mcar_t_test$conf.int[2]*100),
(mpublic_t_test$conf.int[2]*100),
(mtaxi_t_test$conf.int[2]*100),
(mmotorcycle_t_test$conf.int[2]*100),
(mother_t_test$conf.int[2])*100),
Median = c((mwalk_t_test$conf.int[3]*100),
(mbike_t_test$conf.int[3]*100),
(mcar_t_test$conf.int[3]*100),
(mpublic_t_test$conf.int[3]*100),
(mtaxi_t_test$conf.int[3]*100),
(mmotorcycle_t_test$conf.int[3]*100),
(mother_t_test$conf.int[3]*100)),
`Interquartile range` = c(((walk_quartiles[4] - walk_quartiles[2])*100),
((bike_quartiles[4] - bike_quartiles[2])*100),
((car_quartiles[4] - car_quartiles[2])*100),
((public_quartiles[4] - public_quartiles[2])*100),
((taxi_quartiles[4] - taxi_quartiles[2])*100),
((motorcycle_quartiles[4] - motorcycle_quartiles[2])*100),
((other_quartiles[4] - other_quartiles[2])*100)),
`Standard deviation` = c((walk_st_dev*100), (bike_st_dev*100), (car_st_dev*100), (taxi_st_dev*100), (public_st_dev*100), (motorcycle_st_dev*100), (other_st_dev*100)))
kable(mode_summary, digits = 2)
| Variable | Sample mean | Population mean (95% confidence) - low | Population mean (95% confidence) - high | Median | Interquartile range | Standard deviation |
|---|---|---|---|---|---|---|
| Percentage of Workers Who Commute by Walking | 5.58 | 5.07 | 6.09 | NA | 4.63 | 10.02 |
| Percentage of Workers Who Commute by Bike | 0.89 | 0.78 | 0.99 | NA | 0.72 | 2.04 |
| Percentage of Workers Who Commute by Car | 76.42 | 75.39 | 77.45 | NA | 20.37 | 20.05 |
| Percentage of Workers Who Commute by Public Transit | 10.50 | 9.83 | 11.18 | NA | 12.95 | 0.69 |
| Percentage of Workers Who Commute by Taxi | 0.29 | 0.25 | 0.32 | NA | 0.08 | 13.17 |
| Percentage of Workers Who Commute by Motorcycle | 0.08 | 0.06 | 0.10 | NA | 0.00 | 0.41 |
| Percentage of Workers Who Commute by Other Means | 1.03 | 0.95 | 1.12 | NA | 1.42 | 1.58 |
pretty_asthma_hist <- asthma_hist +
theme_bw() +
scale_x_continuous(name = "Percentage of residents with asthma") +
scale_y_continuous(name = "Number of tracts") +
theme(axis.text.x = element_text(angle = 90))
pretty_obesity_hist <- obesity_hist +
theme_bw() +
scale_x_continuous(name = "Percentage of residents with obesity") +
scale_y_continuous(name = "Number of tracts") +
theme(axis.text.x = element_text(angle = 90))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
pretty_mental_hist <- mental_hist +
theme_bw() +
scale_x_continuous(name = "Percentage of residents with poor mental health") +
scale_y_continuous(name = "Number of tracts") +
theme(axis.text.x = element_text(angle = 90))
grid.arrange(pretty_asthma_hist, pretty_obesity_hist, pretty_mental_hist,
ncol = 3)
## Warning: Removed 15 rows containing non-finite values (stat_bin).
## Warning: Removed 15 rows containing non-finite values (stat_bin).
## Warning: Removed 15 rows containing non-finite values (stat_bin).
Initial Analysis of Categorical Data
pct_white <- t.test(data$white_majority == "Majority White")
pct_nonwhite <- t.test(data$white_majority == "Non-Majority White")
pct_female <- t.test(data$female_majority == "Female Majority")
pct_notfemale <- t.test(data$female_majority == "Non-Female Majority")
cat_summary_race <- tibble(`Majority Race` =
c("White",
"Not White"),
`Sample proportion` =
c(pct_white$estimate * 100,
pct_nonwhite$estimate *100),
`95-percent confidence - low` =
c(pct_white$conf.int[1] * 100,
pct_nonwhite$conf.int[1] * 100),
`95-percent confidence - high` =
c(pct_white$conf.int[2] * 100,
pct_nonwhite$conf.int[2] * 100))
kable(cat_summary_race, digits = 2)
| Majority Race | Sample proportion | 95-percent confidence - low | 95-percent confidence - high |
|---|---|---|---|
| White | 89.51 | 87.95 | 91.08 |
| Not White | 9.54 | 8.04 | 11.04 |
cat_summary_sex <- tibble(`Majority Sex` =
c("Female",
"Not Female"),
`Sample proportion` =
c(pct_female$estimate * 100,
pct_notfemale$estimate *100),
`95-percent confidence - low` =
c(pct_female$conf.int[1] * 100,
pct_notfemale$conf.int[1] * 100),
`95-percent confidence - high` =
c(pct_female$conf.int[2] * 100,
pct_notfemale$conf.int[2] * 100))
kable(cat_summary_sex, digits = 2)
| Majority Sex | Sample proportion | 95-percent confidence - low | 95-percent confidence - high |
|---|---|---|---|
| Female | 69.08 | 66.72 | 71.44 |
| Not Female | 29.97 | 27.63 | 32.31 |
Assignment - Bivariate regression
In trying to understand the importance of walkability to health and commuting patterns, we ran a regression analysis for correlation using ‘cor.test’ and ‘age_model’, then plotted the data on a graph. Our dependent variable is “walkability” provided through a Massachusetts Census Tracts’ walkability score. Our independent variables are divided into two fields: health metrics and commuting choice. For health metrics, we are using the percent of the population afflicted with asthma, obesity, or mental health issues respectively. For commuting choice, we’re using the percent of the population that commutes via walking, biking, public transit, or car respectively.
Walkability and Health Metrics
Walkability and Asthma
cor.test(~ avg_w_ind + pct_asthma, data = data)
##
## Pearson's product-moment correlation
##
## data: avg_w_ind and pct_asthma
## t = 11.883, df = 1461, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2494104 0.3428982
## sample estimates:
## cor
## 0.2968655
age_model <- lm(avg_w_ind ~ pct_asthma, data = data)
summary(age_model)
##
## Call:
## lm(formula = avg_w_ind ~ pct_asthma, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8997 -3.1243 0.3704 2.9462 8.7244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4426 0.7741 3.155 0.00164 **
## pct_asthma 0.8722 0.0734 11.883 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.723 on 1461 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.08813, Adjusted R-squared: 0.0875
## F-statistic: 141.2 on 1 and 1461 DF, p-value: < 2.2e-16
ggplot(data) +
geom_point(aes(x = avg_w_ind, y = pct_asthma)) +
geom_smooth(aes(x = avg_w_ind, y = pct_asthma), color = 'red', method = 'lm', se = FALSE) +
labs(x = "Walkability Score",
y = "Percent of Pop. with Asthma",
title = "Regression: Walkability Score and Asthma Rates, Correlation: 0.29")
For asthma, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.08 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher asthma rates are associated with more walkable census tracts and that it is statistically significant.
Walkability and Obesity
cor.test(~ avg_w_ind + pct_obesity, data = data)
##
## Pearson's product-moment correlation
##
## data: avg_w_ind and pct_obesity
## t = -0.48809, df = 1461, p-value = 0.6256
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06397625 0.03850640
## sample estimates:
## cor
## -0.01276846
age_model <- lm(avg_w_ind ~ pct_obesity, data = data)
summary(age_model)
##
## Call:
## lm(formula = avg_w_ind ~ pct_obesity, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8134 -3.5706 0.9706 3.2836 7.4787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.798894 0.484225 24.367 <2e-16 ***
## pct_obesity -0.008676 0.017775 -0.488 0.626
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.899 on 1461 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.000163, Adjusted R-squared: -0.0005213
## F-statistic: 0.2382 on 1 and 1461 DF, p-value: 0.6256
ggplot(data) +
geom_point(aes(x = avg_w_ind, y = pct_obesity)) +
geom_smooth(aes(x = avg_w_ind, y = pct_obesity), color = 'red', method = 'lm', se = FALSE) +
labs(x = "Walkability Score",
y = "Percent of Pop. with Obesity",
title = "Regression: Walkability Score and Obesity Rates, Correlation: -0.01")
For obesity, the 95-percent confidence interval for the correlation includes zero, meaning the direction of the correlation (positive or negative) is uncertain. Additionally, this finding is supported by the regression, with an R-squared value of -0.00052 and a p-value of 0.625. This means we cannot inform with with confidence whether obesity and walkable census tracts are correlated and that this finding’s significance is indeterminable with the present evidence.
Walkability and Mental Health
cor.test(~ avg_w_ind + pct_mental, data = data)
##
## Pearson's product-moment correlation
##
## data: avg_w_ind and pct_mental
## t = 13.737, df = 1461, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2920163 0.3828188
## sample estimates:
## cor
## 0.3382045
age_model <- lm(avg_w_ind ~ pct_mental, data = data)
summary(age_model)
##
## Call:
## lm(formula = avg_w_ind ~ pct_mental, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7242 -2.9927 0.3731 2.9256 8.3049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.08786 0.41030 14.84 <2e-16 ***
## pct_mental 0.39927 0.02907 13.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.669 on 1461 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.1144, Adjusted R-squared: 0.1138
## F-statistic: 188.7 on 1 and 1461 DF, p-value: < 2.2e-16
ggplot(data) +
geom_point(aes(x = avg_w_ind, y = pct_mental)) +
geom_smooth(aes(x = avg_w_ind, y = pct_mental), color = 'red', method = 'lm', se = FALSE) +
labs(x = "Walkability Score",
y = "Percent of Pop. with Mental Health Issues",
title = "Regression: Walkability Score and Mental Health Rates, Correlation: 0.33")
For mental health, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.11 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher incidence of mental health issues are associated with more walkable census tracts and that this finding is statistically significant.
Walkability and Commute Choice
Walkability and Commuting by Walking
cor.test(~ avg_w_ind + pct_walk, data = data)
##
## Pearson's product-moment correlation
##
## data: avg_w_ind and pct_walk
## t = 16.317, df = 1462, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3482749 0.4349887
## sample estimates:
## cor
## 0.3925036
age_model <- lm(avg_w_ind ~ pct_walk, data = data)
summary(age_model)
##
## Call:
## lm(formula = avg_w_ind ~ pct_walk, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8213 -3.1892 0.6462 3.0897 7.3097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.7073 0.1075 99.60 <2e-16 ***
## pct_walk 15.2938 0.9373 16.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.594 on 1462 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.1541, Adjusted R-squared: 0.1535
## F-statistic: 266.3 on 1 and 1462 DF, p-value: < 2.2e-16
ggplot(data) +
geom_point(aes(x = avg_w_ind, y = pct_walk)) +
geom_smooth(aes(x = avg_w_ind, y = pct_walk), color = 'red', method = 'lm', se = FALSE) +
labs(x = "Walkability Score",
y = "Percent of Pop. Commuting Via Walking",
title = "Regression: Walkability Score and Walking Use, Correlation: 0.39")
For commuting by walking, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.15 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher commuting rates by walking are associated with more walkable census tracts and that this finding is statistically significant.
Walkability and Commuting by Biking
cor.test(~ avg_w_ind + pct_bike, data = data)
##
## Pearson's product-moment correlation
##
## data: avg_w_ind and pct_bike
## t = 14.107, df = 1462, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3002402 0.3904558
## sample estimates:
## cor
## 0.346148
age_model <- lm(avg_w_ind ~ pct_bike, data = data)
summary(age_model)
##
## Call:
## lm(formula = avg_w_ind ~ pct_bike, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9704 -3.2022 0.5954 3.0827 7.7655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9704 0.1045 104.94 <2e-16 ***
## pct_bike 66.4250 4.7085 14.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.666 on 1462 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.1198, Adjusted R-squared: 0.1192
## F-statistic: 199 on 1 and 1462 DF, p-value: < 2.2e-16
ggplot(data) +
geom_point(aes(x = avg_w_ind, y = pct_bike)) +
geom_smooth(aes(x = avg_w_ind, y = pct_bike), color = 'blue', method = 'lm', se = FALSE) +
labs(x = "Walkability Score",
y = "Percent of Pop. Commuting Via Biking",
title = "Regression: Walkability Score and Biking Use, Correlation: 0.34")
For commuting by biking, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.11 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher commuting rates by biking are associated with more walkable census tracts and that this finding is statistically significant.
Walkability and Commuting by Public Transit
cor.test(~ avg_w_ind + pct_public, data = data)
##
## Pearson's product-moment correlation
##
## data: avg_w_ind and pct_public
## t = 19.814, df = 1462, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4187422 0.4995602
## sample estimates:
## cor
## 0.4601037
age_model <- lm(avg_w_ind ~ pct_public, data = data)
summary(age_model)
##
## Call:
## lm(formula = avg_w_ind ~ pct_public, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.7747 -2.8021 0.2858 2.7100 8.1044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.1272 0.1160 87.31 <2e-16 ***
## pct_public 13.6476 0.6888 19.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.469 on 1462 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.2117, Adjusted R-squared: 0.2112
## F-statistic: 392.6 on 1 and 1462 DF, p-value: < 2.2e-16
ggplot(data) +
geom_point(aes(x = avg_w_ind, y = pct_public)) +
geom_smooth(aes(x = avg_w_ind, y = pct_public), color = 'green', method = 'lm', se = FALSE) +
labs(x = "Walkability Score",
y = "Percent of Pop. Commuting Via Transit",
title = "Regression: Walkability Score and Transit Use, Correlation: 0.46")
For commuting by transit, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.21 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher commuting rates by transit are associated with more walkable census tracts and that this finding is statistically significant.
Walkability and Commuting by Car
cor.test(~ avg_w_ind + pct_car, data = data)
##
## Pearson's product-moment correlation
##
## data: avg_w_ind and pct_car
## t = -23.1, df = 1462, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5536691 -0.4785508
## sample estimates:
## cor
## -0.517105
age_model <- lm(avg_w_ind ~ pct_car, data = data)
summary(age_model)
##
## Call:
## lm(formula = avg_w_ind ~ pct_car, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.2611 -2.7227 0.1955 2.6107 8.0268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.2611 0.3446 55.89 <2e-16 ***
## pct_car -10.0765 0.4362 -23.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.344 on 1462 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.2674, Adjusted R-squared: 0.2669
## F-statistic: 533.6 on 1 and 1462 DF, p-value: < 2.2e-16
ggplot(data) +
geom_point(aes(x = avg_w_ind, y = pct_car)) +
geom_smooth(aes(x = avg_w_ind, y = pct_car), color = 'orange', method = 'lm', se = FALSE) +
labs(x = "Walkability Score",
y = "Percent of Pop. Commuting Via Car",
title = "Regression: Walkability Score and Car Use, Correlation: -0.51")
For commuting by car, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are negative. Additionally, this finding is supported by the regression, with an R-squared value of 0.26 and a p-value of less than 0.05. This means we can say with 95-percent confidence that lower commuting rates by car are associated with more walkable census tracts and that this finding is statistically significant.
Categorical variables
Sex_majority
sex_model <- lm(avg_w_ind ~ female_majority, data = data)
summary(sex_model)
##
## Call:
## lm(formula = avg_w_ind ~ female_majority, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.402 -3.602 0.950 3.292 7.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6292 0.1222 95.13 <2e-16 ***
## female_majorityNon-Female Majority -0.2267 0.2222 -1.02 0.308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.906 on 1462 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.0007113, Adjusted R-squared: 2.781e-05
## F-statistic: 1.041 on 1 and 1462 DF, p-value: 0.3078
ggplot(data) +
geom_point(aes(x = avg_w_ind, y = female_majority)) +
geom_smooth(aes(x = avg_w_ind, y = female_majority), color = 'orange', method = 'lm', se = FALSE) +
labs(x = "Walkability Score",
y = "Percent of Pop. that identify as female",
title = "Walkability Score by Sex")
Race_majority
race_model <- lm(avg_w_ind ~ white_majority, data = data)
summary(race_model)
##
## Call:
## lm(formula = avg_w_ind ~ white_majority, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9151 -3.2729 0.3703 3.1949 7.8655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.2386 0.1039 108.125 <2e-16 ***
## white_majorityNon-Majority White 3.3431 0.3349 9.981 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.781 on 1462 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.0638, Adjusted R-squared: 0.06316
## F-statistic: 99.63 on 1 and 1462 DF, p-value: < 2.2e-16
ggplot(data) +
geom_point(aes(x = avg_w_ind, y = white_majority)) +
geom_smooth(aes(x = avg_w_ind, y = white_majority), color = 'orange', method = 'lm', se = FALSE) +
labs(x = "Walkability Score",
y = "Percent of Pop. that identify as White",
title = "Walkability Score by Race")
Although the sex_category has is not statistically significant, however, it shows that census tracts with a higher female_majority increase walkability score for the census tract.
We can also see that Race_majority is statistically significant, census tracts with higher non-majority white race increase the walkability score for the census tract.
Assignment 4 - Multivariate regression
full_model <- lm(avg_w_ind ~ pct_asthma + pct_obesity + pct_mental + pct_walk + pct_bike + pct_public + pct_car + pct_taxi + pct_motorcycle + pct_other + white_majority, data)
summary(full_model)
##
## Call:
## lm(formula = avg_w_ind ~ pct_asthma + pct_obesity + pct_mental +
## pct_walk + pct_bike + pct_public + pct_car + pct_taxi + pct_motorcycle +
## pct_other + white_majority, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7785 -2.3036 -0.0372 2.2156 7.9100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.614270 2.211977 -0.730 0.465639
## pct_asthma 0.003785 0.161408 0.023 0.981297
## pct_obesity -0.041913 0.025393 -1.651 0.099038 .
## pct_mental 0.332207 0.068165 4.874 1.22e-06 ***
## pct_walk 13.791473 2.694690 5.118 3.50e-07 ***
## pct_bike 38.529980 5.115190 7.532 8.71e-14 ***
## pct_public 17.431580 2.233629 7.804 1.14e-14 ***
## pct_car 8.096956 2.244250 3.608 0.000319 ***
## pct_taxi 83.076529 12.046489 6.896 7.94e-12 ***
## pct_motorcycle 22.946219 18.966155 1.210 0.226533
## pct_other 30.050899 5.836046 5.149 2.98e-07 ***
## white_majorityNon-Majority White 0.085221 0.319473 0.267 0.789696
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.947 on 1451 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.4328, Adjusted R-squared: 0.4285
## F-statistic: 100.6 on 11 and 1451 DF, p-value: < 2.2e-16
In our bivariate regression, walkability scores increases in census tracts that have higher numbers of the following medical conditions: Asthma and mental health issues, also increases with a higher percentage of the population who uses the following modes of transit: Walk, Bike, Public Transit. Walkability scores also increase with an increased population who identifies as non-majority white.
When we control for all medical conditions variables (asthma, obesity, mental issues) and all different modes of transportation variables (walk, bike, public transit, car), we found that the percentage of population that commutes by bikes contributes by 33.52615 positively to the walkability score of the census tract, making it the most significant category. Similarly, the population that uses public transit contributes to the increase of walkability in a census tract by 11.57760. Regarding medical conditions, populations with high mental issues contribute significantly to the walkability of the census tract by 0.476.
Although the female_majorty variant was not significant as a bivariant, combined with other variants in the multivariant regression, it increases in significance. Non_female majority census tracts decrease the walkability score -0.63827
This explains that census tracts with access to public transit, walking routes, and bike paths, with population, that with majority non-white race, female majority, that report a higher number of mental issues and asthma are the most significant variants to increase walkability scores.
Overall, our model explains about 40 percent of the variation in walkability-level can be explained with these variables.
Assignment 5 - Transformations
centered_data <- data %>%
mutate (avg_w_ind = avg_w_ind - mean(avg_w_ind, na.rm=TRUE),
pct_walk = pct_walk - mean(pct_walk, na.rm=TRUE),
pct_car = pct_car - mean(pct_car, na.rm=TRUE),
pct_public = pct_public - mean(pct_public, na.rm=TRUE),
pct_bike = pct_bike - mean(pct_bike, na.rm=TRUE),
pct_taxi = pct_taxi - mean(pct_taxi, na.rm=TRUE),
pct_motorcycle = pct_motorcycle - mean(pct_motorcycle, na.rm=TRUE),
pct_other = pct_other - mean(pct_other, na.rm=TRUE),
pct_asthma = pct_asthma - mean(pct_asthma, na.rm=TRUE),
pct_mental = pct_mental - mean(pct_mental, na.rm=TRUE),
pct_obesity = pct_obesity - mean(pct_obesity, na.rm=TRUE))
centered_model <- lm(avg_w_ind ~ pct_walk + pct_car + pct_public + pct_bike + pct_taxi + pct_motorcycle + pct_other + pct_asthma + pct_mental + pct_obesity + white_majority, centered_data)
export_summs(full_model, centered_model,
error_format = "(p = {p.value})",
error_pos = "same",
model.names = c("Initial", "Centered"))
| Initial | Centered | |
|---|---|---|
| (Intercept) | -1.61 (p = 0.47) | 0.00 (p = 0.96) |
| pct_asthma | 0.00 (p = 0.98) | 0.00 (p = 0.98) |
| pct_obesity | -0.04 (p = 0.10) | -0.04 (p = 0.10) |
| pct_mental | 0.33 *** (p = 0.00) | 0.33 *** (p = 0.00) |
| pct_walk | 13.79 *** (p = 0.00) | 13.79 *** (p = 0.00) |
| pct_bike | 38.53 *** (p = 0.00) | 38.53 *** (p = 0.00) |
| pct_public | 17.43 *** (p = 0.00) | 17.43 *** (p = 0.00) |
| pct_car | 8.10 *** (p = 0.00) | 8.10 *** (p = 0.00) |
| pct_taxi | 83.08 *** (p = 0.00) | 83.08 *** (p = 0.00) |
| pct_motorcycle | 22.95 (p = 0.23) | 22.95 (p = 0.23) |
| pct_other | 30.05 *** (p = 0.00) | 30.05 *** (p = 0.00) |
| white_majorityNon-Majority White | 0.09 (p = 0.79) | 0.09 (p = 0.79) |
| N | 1463 | 1463 |
| R2 | 0.43 | 0.43 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
Analysis
In our multi-variate regression, we noticed that our intercept was a negative number. We found this interesting for interpretation because the regression table as originally estimated tells us that if all other variables (means of commute, certain health indicators, and ) are held at zero, we would expect a negative Walkability Index Score. Given it’s not possible to have a negative Walkability Index score, we thought it would be valuable to mean center our continuous variables in order to see how the intercept value might shift if all variables were end at their mean value.
What we found in the process of mean centering our data did very little to improve the overall analysis of our regression table. With a p-value of .96, the mean-centered intercept was statistically insignificant. Even if it had produced a significant p-value, the estimated walkability score of 0 does not offer any new insight as once again this isn’t a possible score on the Walkability Index (1-20). As expected, all other estimated coefficient values stayed constant despite the mean-centering. Ultimately, we prefer this second model, as it shows that mean-centering has been attempted (as opposed to not addressing the negative intercept value in the non-mean centered multi-variate regression) but we acknowledge this does not offer any new insight.
Resources
Centers for Disease Control and Prevention, “PLACES: Local Data for Better Health, Census Tract Data 2020 release”. 2020. https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh
United States Census Bureau. American Community Survey, 5-year estimates. 2019.
Watson et al., “Associations between the National Walkability Index and walking among US Adults - National Health Interview Survey, 2015”. 2020. https://pubmed.ncbi.nlm.nih.gov/32389677/
Riggs, William Warren and Suresh Andrew Sethi, “Multimodal travel behaviour, walkability indices, and social mobility: how neighbourhood walkability, income and household characteristics guide walking, biking & transit decisions”. 2020. https://www.researchgate.net/publication/344647424_Multimodal_travel_behaviour_walkability_indices_and_social_mobility_how_neighbourhood_walkability_income_and_household_characteristics_guide_walking_biking_transit_decisions